In Sections 12.1 - 12.3 we saw how supervised and unsupervised learners alike can be extended to perform nonlinear learning via the use of arbitrary linear combination of nonlinear functions / feature transformations. However the general problem of engineering an appropriate nonlinearity for a given dataset has thus far remained elusive. In this Section we introduce the first of two major tools for handeling this task: universal approximators. Universal approximators are families of simple nonlinear feature transformations whose members can be combined to create artbitrarily complex nonlinearities like any we would ever expect to find in a supervised or unsupervised learning dataset. Here we will also introduce the three standard types of universal approximators employed in practice today - kernels, neural networks, and trees - of which we will have much more to say of in future Chapters.
## This code cell will not be shown in the HTML version of this notebook
# imports from custom library
import sys
sys.path.append('../../')
import autograd.numpy as np
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'
# plotting tools
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True
%load_ext autoreload
%autoreload 2
In reality we virtually never have 'perfect' datasets to learn on, and have to make due with what we have. However the same principal we employ in the 'perfect' data scenario - of employing members of a family of universal approximators in order to uncover the nonlinearity of a dataset - still makes sense here. For example, below we animate the learning of a set of 100 single layer tanh neural network units to a realistic regression dataset that looks something like the first 'perfect' dataset shown in the previous Subsection. Here we show what the resulting fit from $100$ evenly sampled weights from a run of $5000$ gradient descent steps used to minimize the corresponding regression Least Squares cost function. As yo move the slider from left to right you can track which (weights of each) step of gradient descent is being used in the current fit shown by tracking where the red dot on the cost function history plot in the right panel is currently located. As you pull the slider from left to right, using more and more refined weights, the resulting fit gets better.
## This code cell will not be shown in the HTML version of this notebook
# load in dataset
csvname = datapath + 'universal_regression_samples_0.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib5 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib5.choose_features(name = 'multilayer_perceptron',layer_sizes = [1,100,1],activation = 'tanh')
# choose normalizer
mylib5.choose_normalizer(name = 'standard')
# choose cost
mylib5.choose_cost(name = 'least_squares')
# fit an optimization
mylib5.fit(max_its = 5000,alpha_choice = 10**(-1))
# load up animator
demo5 = nonlib.run_animators.Visualizer(csvname)
# pluck out a sample of the weight history
num_frames = 100 # how many evenly spaced weights from the history to animate
# animate based on the sample weight history
demo5.animate_1d_regression(mylib5,num_frames,scatter = 'points',show_history = True)
Below we repeat the experiment above only here we use $50$ stump units, tuning them to the data using $5000$ gradient descent steps. Once again as you move the slider a fit resulting from the certain step of gradient descent reflected on the cost function history is shown, and as you move from left to right the run progresses and the fit gets better.
## This code cell will not be shown in the HTML version of this notebook
# load in dataset
csvname = datapath + 'universal_regression_samples_0.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib6 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib6.choose_features(name = 'stumps')
# choose normalizer
mylib6.choose_normalizer(name = 'none')
# choose cost
mylib6.choose_cost(name = 'least_squares')
# fit an optimization
mylib6.fit(max_its = 5000,alpha_choice = 10**(-2))
# load up animator
demo6 = nonlib.run_animators.Visualizer(csvname)
# pluck out a sample of the weight history
num_frames = 100 # how many evenly spaced weights from the history to animate
# animate based on the sample weight history
demo6.animate_1d_regression(mylib6,num_frames,scatter = 'points',show_history = True)
This same phenomenon holds if we perform any other sort of learning - like classification. Below we use a set of stumps to perform two-class classification, training via gradient descent, on a realistic dataset that is reminiscent of the 'perfect' three dimensional classification dataset shown in the previous Subsection. As you pull the slider from left to right the tree-based model employs weights from further in the optimization run, and the fit gets better.
## This code cell will not be shown in the HTML version of this notebook
# load in data
csvname = datapath + '2eggs_data.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib7 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib7.choose_features(name = 'stumps')
# choose normalizer
mylib7.choose_normalizer(name = 'none')
# choose cost
mylib7.choose_cost(name = 'softmax')
# fit an optimization
mylib7.fit(max_its = 1000,alpha_choice = 10**(-2))
# plot
demo7 = nonlib.run_animators.Visualizer(datapath + '2eggs_data.csv')
frames = 10
demo7.animate_static_N2_simple(mylib7,frames,show_history = False,scatter = 'on',view = [30,-50])
This sort of trend holds for multiclass classification (and unsupervised learning problems as well), as illustrated in the example below. Here we have tuned $100$ single layer tanh neural network units minimizing the multiclass softmax cost to fit a toy $C=3$ class dataset. As you move the slider below from left to right weights from a run of $10,000$ steps of gradient descent are used, with steps from later in the run used as the slider is moved from left to right.
## This code cell will not be shown in the HTML version of this notebook
# load in dataset
csvname = datapath + '3_layercake_data.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib8 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib8.choose_features(name = 'multilayer_perceptron',layer_sizes = [2,100,3],activation = 'tanh')
# choose normalizer
mylib8.choose_normalizer(name = 'standard')
# choose cost
mylib8.choose_cost(name = 'multiclass_softmax')
# fit an optimization
mylib8.fit(max_its = 10000,alpha_choice = 10**(-1))
# plot cost history
mylib8.show_histories(start = 10)
# load up animator
demo8 = nonlib.run_animators.Visualizer(csvname)
# pluck out a sample of the weight history
num_frames = 30 # how many evenly spaced weights from the history to animate
# animate based on the sample weight history
demo8.multiclass_animator(mylib8,num_frames,scatter = 'points',show_history = True)
However there is one very distinct difference between the case of 'perfect' and real data in terms of how we employ universal approximators to correctly determine the proper amount of nonlinearity present in real data: with real data we can tune the parameters of a model employing universal approximators too well, and/or use too many of universal approximators, and/or use universal approximators that are too nonlinear for the dataset given. In short, the model we use (a linear combination of universal approximators) can be too nonlinear for a real dataset.
For example, below we animate the fit provided by a large number of polynomial units to the real regression dataset shown in the first two examples of this Subsection. Here we progressively fit more and more polynomial units to this dataset, displaying the resulting fit and corresponding Least Squares error provided by the nonlinear model. As you move the slider from left to right you can see the result of fitting each successive polynomial model to the dataset, with the number of polynomials in the model shown displayed over the left panel (where the data and corresponding fit are shown). In the right panel we show the Least squares error - or cost function value - of this model. As you can see moving the slider from left to right, adding more polynomial units always decreases the cost function value (just as in the 'perfect' data case) however the resulting fit - after a certain point - actually gets worse. It is not that the model is fitting the training data worse as the model becomes more flexible, it is simply that after a certain number of universal approximators are used (here around 15) the tuned model clearly becomes too nonlinear for the phenomenon at hand, and hence becomes a poor model for future test data.
## This code cell will not be shown in the HTML version of this notebook
# load in nonlinear regression demo and run over range of units
demo10 = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'universal_regression_samples_0.csv'
demo10.load_data(csvname)
demo10.brows_single_fit(basis='poly',num_units = [v for v in range(1,155,1)])
This sort of phenomenon is a problem regardless of the sort of universal approximator we use - whether it be a kernel, neural network, or tree-based catalog of functions. As another example, below we animate the fitting of $1$ through $20$ polynomials (left panel), single layer tanh neural network (middle panel), and stump units (right panel) to the simple sinusoidal regression dataset we have used previously in e.g., the first example in Section 12.1. As you move the slider from left to right you will see the fit resulting from the use of more and more of each type of unit. As you continue to add units in each case the resulting fit indeed provides a better fit to the training data, but after a certain point for each type of universal approximator the fit clearly becomes poor for future test data.
## This code cell will not be shown in the HTML version of this notebook
# run comparison demo for regression using all three main catalogs of universal approximators
demo11 = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo11.load_data(csvname)
demo11.brows_fits(num_elements = [v for v in range(1,20,1)])
This same problem presents itself with all real supervised / unsupervised learning datasets. For example, if we take the two-class classification dataset shown in the third example of this Subsection and more completely tune the parameters of the same set of stumps we learn a model that - while fitting the training data we currently have even better than before - is far too flexible for future test data. Moving the slider one knotch to the right shows the result of a (nearly completely) optimized set of stumps trained to this dataset - with the resulting fit being extremely nonlinear (far to nonlinear for the phenomenon at hand).
## This code cell will not be shown in the HTML version of this notebook
# load in data
csvname = datapath + '2eggs_data.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib12 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib12.choose_features(name = 'stumps')
# choose normalizer
mylib12.choose_normalizer(name = 'none')
# choose cost
mylib12.choose_cost(name = 'softmax')
# fit an optimization
mylib12.fit(optimizer = 'newtons method',max_its = 1)
# plot
demo12 = nonlib.run_animators.Visualizer(datapath + '2eggs_data.csv')
frames = 2
demo12.animate_static_N2_simple(mylib12,frames,show_history = False,scatter = 'on',view = [30,-50])
As with regression, this sort of phenomenon can happen irregardless of the sort of universal approximator we use. For example, below we show the subsequent fitting of a few degree $D$ polynomials in the range between $1$ through and $50$ to the same dataset. While the cost function value / fit to the training data indeed decreases with each subsequent polynomial, as you can see - after a certain point - the fit becomes far too nonlinear.
## This code cell will not be shown in the HTML version of this notebook
# run animator for two-class classification fits
csvname = datapath + '2eggs_data.csv'
demo = nonlib.classification_basis_comparison_3d.Visualizer(csvname)
# run animator
demo.brows_single_fits(num_units = [v for v in range(0,50,10)], basis = 'poly',view = [30,-80])
In the jargon of machine learning / deep learning the amount of nonlinearity, or nonlinear potential, a model has is commonly referred to as the model's capacity. With real data in practice we need to make sure our trained model does not have too little capacity (it is not too rigid) nor too much capacity (that it is not too flexible). In the jargon of our trade this desire - to get the capacity just right - often goes by the name the bias-variance trade-off. A model with too little capacity is said to underfit the data, or to have high bias. Conversely, a model with too much capacity is said to overfit the data, or have high variance.
Phrasing our pursuit in these terms, this means that with real data we want to tune the capacity of our model 'just right' as to solve this bias-variance trade-off, i.e., so that our model has neither too little capacity (a 'high bias' or underfitting) nor too much capacity (a 'high variance' or overfitting).
With real data in practice we need to make sure our
modeldoes not have too little capacity (it is not too rigid) nor too much capacity (that it is not too flexible). In the jargon of our trade this desire - to get the capacity of ourmodeljust right - often goes by the name the bias-variance trade-off. Amodelwith too little capacity is said to underfit the data, or to have high bias. Conversely, amodelwith too much capacity is said to overfit the data, or have high variance.
With perfect data - where we have (close to) infinitely many training data points that perfectly describe a phenomenon - we have seen that we can always determinine appropriate nonlinearity by increasing the capacity of our model. By doing this we consistently decreases the error of the model on this training dataset - while improving how the model represents the (training) data.
However with real data we saw that the situation is more sensitive. It is still true that by increasing a model's capacity we decrease its error on the training data, and this does improve its ability to represent our training data. But because our training data is not perfect - we usually have only a subsample of (noisy examples of) the true pheenomenon - this becomes problematic when the model begins overfitting. At a certain point of capacity the model starts representing our training data too well, and becomes a prediction tool for future input.
The problem here is that nothing about the training error tells us when a model begins to overfit a training dataset. The phenomenon of overfitting is not reflected in the training error measurment. So - in other words - training error is the wrong measurment tool for determining the proper capacity of a model. If we are searching through a set of models, in search of the one with very best amount of capacity (when properly tuned) for a given dataset, we cannot determine which one is 'best' by relying on training error. We need a different measurement tool to help us determine the proper amount of nonlinearity a model should have with real data.
Notice in the examples here, when constructing a model with universal approximator feature transformations we always use a single kind of universal approximator per model. That is, we do not mix exemplars from different univeral approximator families - using e.g., a few polynomial units and a few tree units in the same model. This is done for several reasons. First and foremost - as we will see in Chapters following this one (with one Chapter dedicated to additional technical details relating to each universal approximator family) - by restricting a model's feature transforms to a single family we can (in each of the three cases) better manage our search for a model with the proper capacity for a given dataset, optimize the learning process, and better deal with each families' unique eccentricities.
However it is quite common place to learn a set of models - each employing a single family of universal approximators - to a dataset, and then combine or ensemble the fully trained models. We will discuss this further later on in this Chapter.
© This material is not to be distributed, copied, or reused without written permission from the authors.